gradient flow dynamic
Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
The emergence of sparse attention: impact of data distribution and benefits of repetition
Zucchet, Nicolas, d'Angelo, Francesco, Lampinen, Andrew K., Chan, Stephanie C. Y.
Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
On the Neural Feature Ansatz for Deep Neural Networks
Tansley, Edward, Massart, Estelle, Cartis, Coralia
Understanding feature learning is an important open question in establishing a mathematical foundation for deep neural networks. The Neural Feature Ansatz (NFA) states that after training, the Gram matrix of the first-layer weights of a deep neural network is proportional to some power $α>0$ of the average gradient outer product (AGOP) of this network with respect to its inputs. Assuming gradient flow dynamics with balanced weight initialization, the NFA was proven to hold throughout training for two-layer linear networks with exponent $α= 1/2$ (Radhakrishnan et al., 2024). We extend this result to networks with $L \geq 2$ layers, showing that the NFA holds with exponent $α= 1/L$, thus demonstrating a depth dependency of the NFA. Furthermore, we prove that for unbalanced initialization, the NFA holds asymptotically through training if weight decay is applied. We also provide counterexamples showing that the NFA does not hold for some network architectures with nonlinear activations, even when these networks fit arbitrarily well the training data. We thoroughly validate our theoretical results through numerical experiments across a variety of optimization algorithms, weight decay rates and initialization schemes.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence
The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry$-$ the exponential family ($θ$ coordinates) and the mixture family ($η$ coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the underlying statistical model. In continuous time, we prove that the convergence rates of GD in the $θ$ and $η$ coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in $η$ and $θ$ coordinates can be scaled to $2c$ and $\frac{2}{c}$, respectively, for any $c>0$, while NGD maintains a fixed convergence rate of $2$, remaining invariant to such transformations and sandwiched between them. Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.
- Europe > Germany > Saxony > Leipzig (0.04)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin
Recent works exploring the training dynamics of homogeneous neural network weights under gradient flow with small initialization have established that in the early stages of training, the weights remain small and near the origin, but converge in direction. Building on this, the current paper studies the gradient flow dynamics of homogeneous neural networks with locally Lipschitz gradients, after they escape the origin. Insights gained from this analysis are used to characterize the first saddle point encountered by gradient flow after escaping the origin. Also, it is shown that for homogeneous feed-forward neural networks, under certain conditions, the sparsity structure emerging among the weights before the escape is preserved after escaping the origin and until reaching the next saddle point.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- Asia > Middle East > Jordan (0.04)
Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
Stability properties of gradient flow dynamics for the symmetric low-rank matrix factorization problem
Mohammadi, Hesameddin, Tinati, Mohammad, Tu, Stephen, Soltanolkotabi, Mahdi, Jovanović, Mihailo R.
The symmetric low-rank matrix factorization serves as a building block in many learning tasks, including matrix recovery and training of neural networks. However, despite a flurry of recent research, the dynamics of its training via non-convex factorized gradient-descent-type methods is not fully understood especially in the over-parameterized regime where the fitted rank is higher than the true rank of the target matrix. To overcome this challenge, we characterize equilibrium points of the gradient flow dynamics and examine their local and global stability properties. To facilitate a precise global analysis, we introduce a nonlinear change of variables that brings the dynamics into a cascade connection of three subsystems whose structure is simpler than the structure of the original system. We demonstrate that the Schur complement to a principal eigenspace of the target matrix is governed by an autonomous system that is decoupled from the rest of the dynamics. In the over-parameterized regime, we show that this Schur complement vanishes at an $O(1/t)$ rate, thereby capturing the slow dynamics that arises from excess parameters. We utilize a Lyapunov-based approach to establish exponential convergence of the other two subsystems. By decoupling the fast and slow parts of the dynamics, we offer new insight into the shape of the trajectories associated with local search algorithms and provide a complete characterization of the equilibrium points and their global stability properties. Such an analysis via nonlinear control techniques may prove useful in several related over-parameterized problems.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
From exponential to finite/fixed-time stability: Applications to optimization
Ozaslan, Ibrahim K., Jovanović, Mihailo R.
The development of finite/fixed-time stable optimization algorithms typically involves study of specific problem instances. The lack of a unified framework hinders understanding of more sophisticated algorithms, e.g., primal-dual gradient flow dynamics. The purpose of this paper is to address the following question: Given an exponentially stable optimization algorithm, can it be modified to obtain a finite/fixed-time stable algorithm? We provide an affirmative answer, demonstrate how the solution can be computed on a finite-time interval via a simple scaling of the right-hand-side of the original dynamics, and certify the desired properties of the modified algorithm using the Lyapunov function that proves exponential stability of the original system. Finally, we examine nonsmooth composite optimization problems and smooth problems with linear constraints to demonstrate the merits of our approach.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > Middle East > Jordan (0.05)
Accelerated forward-backward and Douglas-Rachford splitting dynamics
Ozaslan, Ibrahim K., Jovanović, Mihailo R.
We examine convergence properties of continuous-time variants of accelerated Forward-Backward (FB) and Douglas-Rachford (DR) splitting algorithms for nonsmooth composite optimization problems. When the objective function is given by the sum of a quadratic and a nonsmooth term, we establish accelerated sublinear and exponential convergence rates for convex and strongly convex problems, respectively. Moreover, for FB splitting dynamics, we demonstrate that accelerated exponential convergence rate carries over to general strongly convex problems. In our Lyapunov-based analysis we exploit the variable-metric gradient interpretations of FB and DR splittings to obtain smooth Lyapunov functions that allow us to establish accelerated convergence rates. We provide computational experiments to demonstrate the merits and the effectiveness of our analysis.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > Middle East > Jordan (0.04)